[WIP] Fix infinite loop when computing parse tree for recursive nullable grammars #874

hudson-ai · 2024-06-02T04:29:02Z

Computing the parse tree (needed for getting captures) previously caused an infinite loop if the grammar has loops of nullable rules, where a nullable rule is any rule where every symbol on the RHS is nullable.

E.g., the grammar

  A -> A C
  A -> B
  A ->
  B -> A
  C -> 'x'

has nullable rules

  A -> B
  A ->
  B -> A

and there is a loop in these rules formed by A -> B and B -> A.

For a deeper discussion on this, see section "One Last Trap" of https://loup-vaillant.fr/tutorials/earley-parsing/parser

This PR solves the issue by keeping a set of visited nodes when computing the parse tree (or getting captures) and skipping any node that has already been seen. AFAIK, the only way this can happen is when we have these nullable loops, but admittedly I still have a lot to learn about the parser (so more eyes on this would be really helpful!)

This closes #863

hudson-ai · 2024-06-02T04:30:15Z

guidance/_parser.py

        while stack:
            item, byte_pos = stack.pop()
+            if (item, byte_pos) in seen:


Maybe just check item

hudson-ai · 2024-06-02T04:30:35Z

guidance/_parser.py

        while stack:
            pos, item = stack.pop()
-
+            if (pos, item) in seen:


Same here -- just need to check item?

hudson-ai · 2024-06-02T04:31:51Z

tests/test_parser.py

+    loop if not handled correctly
+    """
+
+    @pytest.mark.timeout(5)


Not in love with checking for an infinite loop by introducing a timeout, but it wasn't straight-forward to catch this otherwise. Anyone have a suggestion?

5 seconds should be more than ample to compute the parse tree...

I spent some time thinking about this, and couldn't come up with anything better 🤷

codecov-commenter · 2024-06-02T04:35:54Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 60.33%. Comparing base (15ff21d) to head (867adbe).
Report is 36 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #874      +/-   ##
==========================================
+ Coverage   57.02%   60.33%   +3.30%     
==========================================
  Files          63       63              
  Lines        4575     4583       +8     
==========================================
+ Hits         2609     2765     +156     
+ Misses       1966     1818     -148

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

hudson-ai · 2024-06-02T04:38:19Z

tests/test_parser.py

+            parser.consume_byte(b"x")
+            captures, _ = parser.get_captures()
+            assert mock.call_count == 1
+            assert captures == {"B": b"", "A": b"x"}


Quick note here: I'm not sure what the semantics of partial captures are supposed to be, but this test is consistent with the existing behavior without these nullable loops

riedgar-ms

Thanks for digging into this!

riedgar-ms · 2024-06-02T17:09:04Z

tests/test_parser.py

+        # Test that computing the parse tree doesn't hang
+        parser.parse_tree()
+        # Test that getting captures doesn't hang
+        parser.get_captures()


Can you assert anything about the last two operations?

Also, if either of them can hang on an infinite loop, would it be worth having two tests, to identify which one hit the loop (although that may be more than needed).

I reserved making asserts for some of the other tests, but I'll have a think 😄

get_captures actually calls parse_tree under the hood BUT it can also hang even if parse_tree doesn't. I can maybe put another mock in here to disentangle them a bit..?

I wouldn't worry about injecting another mock. I was just wondering if there was any visible state change which could be asserted after each step?

riedgar-ms · 2024-06-02T17:09:47Z

tests/test_parser.py

+        A -> A
+        A ->
+
+        Loop occurs because `A -> A` is a nullable rule


What happens with the JSON 'linked list' here? Doesn't that come down to something like this?

Good question. This is pretty subtle and took me a while to understand.

The grammar

A -> A 'x' A ->

is totally fine because the first rule isn't nullable by the definition I gave above because 'x' isn't nullable -- only A is.

I.e.

zero_or_more('x') is fine; zero_or_more(optional('x')) isn't.

The latter looks like

A -> A B A -> B -> 'x' B ->

Linked lists look more like the former.

I suppose the fact that the existing JSON tests passed is a good sign, but I'm still trying to work out these grammars.

riedgar-ms · 2024-06-02T17:10:52Z

tests/test_parser.py

+        # Test that computing the parse tree doesn't hang
+        parser.parse_tree()
+        # Test that getting captures doesn't hang
+        parser.get_captures()


For all of these tests, it would be nice if there was something to assert about the state of the parser at the end.

Agreed. Will keep thinking about this!

Working on making some assertions here, I became dissatisfied with the fact that my fix doesn't actually prevent infinite parse trees like A -> A -> ...; it only prevents looping over these parse trees forever. The behavior is right in that we're still able to correctly get captures, so maybe this is fine. But marking this as a draft in the meantime while I see if I can get a bit closer to the root of the issue.

hudson-ai · 2024-06-05T16:50:24Z

Currently, my fix is to avoid infinitely recursing over children, but I suspect that it might be nicer to prevent accepting parses that have infinitely looping children to begin with. That being said, the behavior seems correct (in that captures are correctly being... captured).

@slundberg I would really appreciate some input from you on this as well as what the correct semantics of partial captures should be (I tried to preserve existing behavior, but I'm not entirely clear on whether that behavior was the intended one).

hudson-ai added 7 commits June 1, 2024 17:07

First pass at a fix for guidance-ai#863

f17e434

Add fix to _record_captures_from_root

de33cf5

Add tests (+ pytest-timeout test dependency)

5617cf2

Test both partial and full captures

e04d693

Merge remote-tracking branch 'upstream/main' into issue_863

ca2fca9

comments

40533bb

One more test to reflect Loup Vaillant's tutorial

3f2a3ef

hudson-ai requested review from Harsha-Nori, riedgar-ms and paulbkoch June 2, 2024 04:29

hudson-ai commented Jun 2, 2024

View reviewed changes

riedgar-ms reviewed Jun 2, 2024

View reviewed changes

hudson-ai changed the title ~~Fix infinite loop when computing parse tree for recursive nullable grammars~~ [WIP] Fix infinite loop when computing parse tree for recursive nullable grammars Jun 3, 2024

hudson-ai added 2 commits June 3, 2024 15:30

Add other option for B to better track capture state

7c64496

keep parse_tree and get_captures tests separate

867adbe

hudson-ai mentioned this pull request Sep 16, 2024

Test recursive nullable grammars #1026

Open

hudson-ai closed this Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fix infinite loop when computing parse tree for recursive nullable grammars #874

[WIP] Fix infinite loop when computing parse tree for recursive nullable grammars #874

hudson-ai commented Jun 2, 2024

hudson-ai Jun 2, 2024 •

edited

Loading

hudson-ai Jun 2, 2024

hudson-ai Jun 2, 2024

Harsha-Nori Jun 9, 2024

codecov-commenter commented Jun 2, 2024 •

edited

Loading

hudson-ai Jun 2, 2024

riedgar-ms left a comment

riedgar-ms Jun 2, 2024

hudson-ai Jun 2, 2024

riedgar-ms Jun 2, 2024

riedgar-ms Jun 2, 2024

hudson-ai Jun 2, 2024 •

edited

Loading

riedgar-ms Jun 2, 2024

riedgar-ms Jun 2, 2024

hudson-ai Jun 2, 2024

hudson-ai Jun 3, 2024

hudson-ai commented Jun 5, 2024

[WIP] Fix infinite loop when computing parse tree for recursive nullable grammars #874

[WIP] Fix infinite loop when computing parse tree for recursive nullable grammars #874

Conversation

hudson-ai commented Jun 2, 2024

hudson-ai Jun 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jun 2, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

riedgar-ms left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudson-ai Jun 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudson-ai commented Jun 5, 2024

hudson-ai Jun 2, 2024 •

edited

Loading

codecov-commenter commented Jun 2, 2024 •

edited

Loading

hudson-ai Jun 2, 2024 •

edited

Loading